What properties should an overall measure of test performance possess?
نویسنده
چکیده
To the Editor: Obuchowski et al. (1 ) recently reviewed the uses and misuses of ROC curves and explained, in simple terms, some sophisticated solutions to frequent problems, thereby popularizing tools described elsewhere, notably in the book coauthored by Nancy Obuchowski (2 ). What prompts me to comment on this careful and timely review are two fundamental questions: What properties should an overall measure of test performance possess? How do we make better tests score better? ROC curves are indispensable and conceptually straightforward, but selecting a single overall performance measure requires some care. In referring to the “ROC curve and the measures of accuracy derived from it”, the introduction to the review acknowledges that there are several choices. However, like many similar texts, it jumps quickly to examination of the area under the ROC curve (AUROC) without pausing to ask whether the AUROC is the right, or the best, measure. This omission leaves parts of the mathematics without the necessary “philosophical” foundation. It may also promote the misconception that an AUROC calculation is the purpose of drawing a ROC curve. Now, by what criteria should performance measures be selected? The first criterion is that two tests that provide the same information should also score the same. However, if one reorders the test outcomes by, e.g., interchanging “atypical” and “equivocal” on the list from “normal” to “cancer”, the geometry, and therefore the area, will change; nevertheless, the clinical import of each test outcome and, hence, the clinical merits of the test remain unchanged. In fact, it is easy to provide an example [see page 31 of Ref. (2 )] in which a test classifies all patients correctly and yet has an AUROC of 0.5, suggesting complete lack of discrimination. A logical remedy is to assume the most favorable ordering when calculating the AUROC (3 ). Briefly, that means ordering the test outcomes by decreasing likelihood ratio, from ominous to reassuring. It means reassembling the segments of the ROC curve to make it concave. However, the review by Obuchowski et al. (1 ) does not address concavity issues. Insistence on concavity is not enough: Even when tests have concave ROC curves, their AUROCs may fail to rank them correctly. My oldest example (3 ) involves a clinical context where, by virtue of suitable symmetries, two tests are demonstrably equally useful yet have AUROCs as different as 0.85 and 0.90. Although the report giving this example is cited by Obuchowski et al. (1 ) in other contexts, their review does not address this potential shortcoming of the AUROC. Another compelling criterion is this: If a test can be emulated by superimposing pure noise, such as independent measurement error, on another test, it cannot be more informative than the latter and should never score higher; the same applies when test readings are binarized or otherwise coarsened. Once one acknowledges that the AUROC may mislead—which is my main point—it would be gratifying to be able to propose fail-safe ROC summary statistics. Unfortunately, there is a theoretical reason that any alternative statistics will have similar shortcomings: no performance measure based solely on the ROC curve and ignoring the pretest disease probability can rank competing tests without sometimes violating the expected utility principle of decision theory and, hence, clinical rationality (4 ). Performance measures must take pretest probabilities into account and be based on assessments of utility, i.e., clinical benefit and loss, or at least on a provisional utility model (pseudo-regret function) (3, 4). Tests are then ranked by the expected pretest–posttest difference in utility. For example, the simple quadratic (Brier) pseudo-regret function is J(r) 4r(1 r), where r is a disease (vs no disease) probability; J(r) 1 when r 0.5, i.e., when the diagnosis is maximally uncertain, and falls off toward 0 as r approaches 0 or 1, i.e., diagnostic certainty. The expected pseudo-regret drops from J(p), p being the pretest disease probability, to:
منابع مشابه
Presenting a Hybrid Approach based on Two-stage Data Envelopment Analysis to Evaluating Organization Productivity
Measuring the performance of a production system has been an important task in management for purposes of control, planning, etc. Lord Kelvin said :“When you can measure what you are speaking about, and express it in numbers, you know something about it; but when you cannot measure it, when you cannot express it in numbers, your knowledge is of a meager and unsatisfactory kind.” Hence, manag...
متن کاملPsychometric properties of the Persian Version of Test of Performance Strategies among Young Athletes
Introduction and Objectives: self-report psychological questionnaires are important tools for assessing cognitive and emotional status of athletes. Therefore, the purpose of this study was to investigate the psychometric properties of test of performance strategies (TOPS) among athletes. Methodology: 209 males (n=125) and females (n=84) national and champions athletes aged 15 to 18 completed TO...
متن کاملExamining the Relationship of Training on Job Satisfaction and Organizational Effectiveness
Training is considered an important element in the organization, as it heavily influences people to learn how to be more effective at work by modifying knowledge, skills or attitudes through the learning experience to achieve a successful performance. Likewise, training is a long term allegiance to all employees because they have to learn new skills and knowledge which become a turning point fo...
متن کاملRanking Decision-Making Units Using Double-Frontier Analysis Approach
Data envelopment analysis is a nonparametric method for measuring the performance of a set of decision-making units (DMUs) that consume multiple inputs to produce multiple outputs. Using this approach, the performance of DMUs is measured from both optimistic and pessimistic views. However, their results are very misleading and even contradictory in many cases. Indisputably, different performanc...
متن کاملRunning head: MILITARY TRAINING MENTAL TOUGHNESS INVENTORY
24 Three studies were conducted in order to develop and validate a mental toughness instrument 25 for use in military training environments. Study 1 (n = 435) focused on item generation and 26 testing the structural integrity of the Military Training Mental Toughness Inventory 27 (MTMTI). The measure assessed ability to maintain optimal performance under pressure 28 from a range of different st...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- Clinical chemistry
دوره 51 2 شماره
صفحات -
تاریخ انتشار 2005